Working with data sets

Author

Andreas Blombach

Published

August 1, 2024

library(tidyverse)
library(readxl)
library(data.table)
library(corpora)
library(psych)
library(skimr)

Data sets

We can import data in a number of ways. R generally prefers CSV files, but there are packages to read in other file formats (Excel, SPSS, JSON, etc.).

We cannot go into all of the details – but here is a whole course on the topic if you want to learn more: https://learn.datacamp.com/courses/importing-data-in-r-part-1

Reading in data

Those with some R experience probably already know read.table(), read.csv(), read.csv2() etc. (Note that, as of R version 4.0, the parameter stringsAsFactors is set to FALSE by default.)

Alternatively, you can read in a data set as a tibble which is a little faster. For really large files, the data.table package offers the function fread().

Which function is best suited to read in a specific file depends on the file format and the file’s formatting (field separator, decimal point, etc.). (Exception: fread(). fread() doesn’t care and usually figures this out by itself.)

For CSV files in European format (semicolon as field separator, comma as decimal point), use read_csv2():

gen_blogs <- read_csv2("data/Genitive_DWDS_Blogs.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 17512 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (1): Lemma
dbl (2): s.Genitiv, es.Genitiv

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can now have a look at the data:

gen_blogs

You can also use str() to display its internal structure (or the structure of any R object, really) or glimpse() to get an overview over all the columns:

str(gen_blogs)
spc_tbl_ [17,512 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Lemma     : chr [1:17512] "Leben" "Blog" "Internet" "Artikel" ...
 $ s.Genitiv : num [1:17512] 3761 2570 1847 1757 1666 ...
 $ es.Genitiv: num [1:17512] 0 0 0 0 0 6 192 0 0 265 ...
 - attr(*, "spec")=
  .. cols(
  ..   Lemma = col_character(),
  ..   s.Genitiv = col_double(),
  ..   es.Genitiv = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
glimpse(gen_blogs)
Rows: 17,512
Columns: 3
$ Lemma      <chr> "Leben", "Blog", "Internet", "Artikel", "Erachten", "Monat"…
$ s.Genitiv  <dbl> 3761, 2570, 1847, 1757, 1666, 1562, 1479, 1463, 1260, 1241,…
$ es.Genitiv <dbl> 0, 0, 0, 0, 0, 6, 192, 0, 0, 265, 725, 0, 0, 74, 0, 0, 15, …

When you read in data, you can also specify data types for certain columns:

gen_blogs <- read_csv2("data/Genitive_DWDS_Blogs.csv",
                       col_types = "?ii")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
gen_blogs

The first argument is a file path. Since the folder “data” is located in my current working directory, I don’t need to specify the full/absolute path.

Alternatively, you can use file.choose() to select a file:
read_csv2(file.choose())

Try it out!

There’s also read_csv() for classic CSV files (comma as field separator, . as decimal point), read_tsv() for files with tab stops as field separators, and read_delim(), the parent function where you can specify everything yourself.

RStudio offers some options to read in files a little more comfortably: File -> Import Dataset

  • From Text (base)…: base R functions: read.table() etc.: data.frame
  • From Text (readr)…: tidyverse/readr style: tibble
  • From Excel…: Excel files (tibble)
  • From SPSS…: SPSS files (tibble)
  • From SAS…: SAS files (tibble)
  • From Stata…: Stata files (tibble)

Having selected fitting options to import your data and having clicked “Import”, you can see the R command on the console. You can then copy it to your script to speed up the process in the future.

Example: Opening an Excel file:

gen_blogs <- read_excel("data/Genitive_DWDS_Blogs.xlsx")
gen_blogs

So far, we’ve imported all data as tibbles. When you look at these, you can see each column’s data type directly (otherwise, use a function such as str() to display the structure of an object such as a data.frame).

The most common data type abbreviations are:

  • chr for character
  • fct for factor
  • int for integer
  • dbl for double
  • lgl for logical

For a full list, see: https://tibble.tidyverse.org/articles/types.html.

If you’ve got your own data with you, now’s the time to try to open it!

Otherwise, there are a few variations of the same data available:

  • romane.tsv
  • romane.csv
  • romane2.csv
  • romane3.csv
  • romane4.csv

Can you read in all of them correctly?

romane <- read_tsv("data/romane.tsv",
                   col_types = cols(Genre = "f", Kategorie = "f"))
romane

Accessing parts of a data set

To access a column (usually a statistical variable), enter the data set’s name, followed by a dollar sign and the name of the column. We get a vector of values (let’s not display all of them by using head()):

gen_blogs$Lemma |> head()
[1] "Leben"    "Blog"     "Internet" "Artikel"  "Erachten" "Monat"   
gen_blogs$s.Genitiv |> head()
[1] 3761 2570 1847 1757 1666 1562

The weird little operator |> here is called a pipe operator. It was introduced in R 4.1.0 after a similar operator (from the magrittr package), %>%, had already been commonly used in the Tidyverse for quite some time and had gained a lot of traction in the R community.

Both of these operators simply take the object to their left as the (first) input of the function to their right.

This generally makes code more readable – instead of using functions inside of functions inside of functions, just pipe the output to the next function etc.

The last line of code is therefore equivalent to:

head(gen_blogs$s.Genitiv)
[1] 3761 2570 1847 1757 1666 1562

There are some subtle differences between |> and %>% (see e.g. here), but in most cases, they can be used interchangeably. The native pipe (|>) is a little faster, however.

Pressing Ctrl+Shift+M (Mac: Cmd+Shift+M) in RStudio inserts a pipe operator (you can select your preferred one in the options).

Just as with vectors, you can use square brackets to subset a data set. You just have to provide two values: row and column.

romane[2, 3] # second row, third column
romane[3, c(1, 3)] # third row, columns 1 and 3
romane[3,] # third row, all columns (don't forget the comma!)
romane[, 6] # sixth column

To select certain columns, select() is also useful:

romane |> select(Genre, Autor, Titel)
romane |> select(Genre:Type_token_ratio)
romane |> select(Type_token_ratio:last_col())

You can also rename variables:

romane |> select(Titel,
                  TTR = Type_token_ratio,
                  Avg_length = Average_token_length_syllables)

If you just want to rename a column while keeping all other columns, rename() might be more practical:

romane |> rename(TTR = Type_token_ratio)

select() is also useful to change the order of columns:

romane |> select(Titel, Autor, everything()) # everything(): helper function

Filtering data sets

You’ll often want to get parts of a data set not according to their position, but according to certain conditions which must be fulfilled. That’s what filter() is for (or the base R function subset()).

gen_blogs has 17512 rows – let’s just use the lemmas which appear at least five times in any form (arbitrary choice):

gen_blogs <- gen_blogs |> filter(s.Genitiv >= 5 | es.Genitiv >= 5)
gen_blogs

If several conditions have to be fulfilled, they can be separated by commas:

gen_blogs |> filter(s.Genitiv >= 100, es.Genitiv >= 100)

Logical AND works the same way:

gen_blogs |> filter(s.Genitiv >= 100 & es.Genitiv >= 100)

There are some lemmas in gen_blogs that shouldn’t be in there.
Let’s throw them out by using %in%:

gen_blogs <- gen_blogs |>
  filter(
    !(Lemma %in% c("Äußer", "Inner", "Wichtiger", "Schlimmer",
                   "Besser", "Neu"))
  )

Try to …

  • select all rows in gen_blogs where s.Genitiv is exactly 100
  • select all rows in gen_blogs where es.Genitiv is between 100 und 200
  • select all rows in romane where the genre is sci-fi, the type-token ratio is greater than 0.35 and Honoré’s H is greater than 2900
  • select all rows in romane where the lexical density is smaller than 0.35 or greater than 0.45
  • select all rows in romane where the author’s name starts with an “A” (tip: str_detect())

Modifying data

Sometimes, you want to modify certain columns. Luckily, not only can you call a column using the dollar sign notation, you can also assign new values this way.

Alternatively, you can use mutate(), a function that comes in especially handy if you want to modify several columns at once.

In our romane data set, Rarity specifies the fraction of nouns, adjectives and verbs which can also be found in the most common 5000 nouns, adjectives and verbs in a reference corpus (in this case, the DECOW16BX). But this means that a lower value actually signifies a higher rarity (and thus, higher complexity) whereas the other measures in the data set work the other way around (higher value -> higher complexity). Luckily, this is very easy to fix:

romane$Rarity <- 1 - romane$Rarity

More cleanup of gen_blogs, using string functions and regular expressions:

Words ending in -nis have been improperly lemmatised (-niss):

str_subset(gen_blogs$Lemma, "niss$")
 [1] "Bündniss"                  "Ereigniss"                
 [3] "Verhältniss"               "Ergebniss"                
 [5] "Verständniss"              "Aktionsbündniss"          
 [7] "Gedächtniss"               "Selbstverständniss"       
 [9] "Verzeichniss"              "Wahlergebniss"            
[11] "Gefängniss"                "Bedürfniss"               
[13] "Arbeitsverhältniss"        "Bekenntniss"              
[15] "Geheimniss"                "Wahlgeheimniss"           
[17] "Beschäftigungsverhältniss" "Geständniss"              
[19] "Missverständniss"          "Bankgeheimniss"           
[21] "Kapitalverhältniss"        "Erlebniss"                
[23] "Unverständniss"            "Briefgeheimniss"          
[25] "Presseerzeugniss"          "Fernmeldegeheimniss"      
[27] "Vertragsverhältniss"       "Einverständniss"          
[29] "Gleichniss"                "Inhaltsverzeichniss"      
[31] "Mietverhältniss"           "Arbeitsgedächtniss"       
[33] "Begräbniss"                "Jahrhundertereigniss"     
[35] "Textverständniss"          "Untersuchungsergebniss"   
[37] "Verhängniss"               "Ärgerniss"                
gen_blogs$Lemma <- str_replace(gen_blogs$Lemma, "niss$", "nis")

There are also very few lemmas with non-alphanumerical characters at the end:

gen_blogs$Lemma <- str_replace(gen_blogs$Lemma, "[^[:alpha:]]$", "")

Adding columns

If you want to add a column to an existing data.frame, tibble or data.table, the vector needs to have the same length as the other columns.

There are quite a few ways to do this. The easiest one is probably this:

gen_blogs$Length <- str_length(gen_blogs$Lemma) # word length in characters
gen_blogs

Optional step: new column with the number of syllables

# install.packages("sylly")
# install.packages("sylly.de", repo="https://undocumeantit.github.io/repos/l10n")
library(sylly.de)
gen_blogs$Syllables <- hyphen_c(gen_blogs$Lemma, hyph.pattern = "de",
                                quiet = TRUE)

mutate() can be used to add several columns at once, to change existing columns, and to do calculations with columns:

gen_blogs <- gen_blogs |>
  mutate(Total = s.Genitiv + es.Genitiv,
         Frac_es = round(es.Genitiv / Total, 2))
gen_blogs

Sorting

Use arrange() to change the order of rows:

gen_blogs |> arrange(desc(es.Genitiv))

desc() to sort in descending order

You can also sort by several columns:

gen_blogs |> arrange(Length, Lemma)
gen_blogs |>
  arrange(desc(Length), desc(s.Genitiv), desc(es.Genitiv))
romane |> arrange(Kategorie, Genre, Autor, Titel)

“Long” and “wide” format

There are two different presentations for tabular data:

  • In “wide” format, each row represents one unit of observation (e.g. a person, a country, a text or a word). There is no redundancy, making it easy to read. In case of repeated measures (e.g. the same statistical variable at different points in time or under different conditions), there are multiple columns.
  • In “long” (or “narrow”) format, there are multiple rows for a single unit of observation, one for each condition or point in time. All measurements of the same statistical variable will be in a single column, while another column specifies the category (time, condition, …).

Many functions in R require the input data to be in a specific format (mostly “long”), so you should know how to switch between the two.

As a first example, we’ll use frequency data of selected nouns in the written and spoken parts of the British National Corpus (BNC; see ?BNCcomparison):

BNCcomparison |> as_tibble()

Since both written and spoken contain frequencies, we could put these in a single column (frequency), with another column denoting the modality. To transform from “wide” to “long” format, we can use the pivot_longer() function:

BNC_long <- BNCcomparison |>
  pivot_longer(cols = written:spoken,
               names_to = "modality",
               values_to = "frequency")
BNC_long

To transform back to “wide” format, use pivot_wider():

BNC_long |>
  pivot_wider(names_from = "modality",
              values_from = "frequency")

Can you do the same thing with gen_blogs?

There are lots of examples in the vignette – have a look!
vignette("pivot")

The following more complex code shows an example using the romane data. We want to know which genres are more or less complex according to different measures of lexical complexity. Ideally, we’d have a single plot containing all measures by genre. We could use boxplots – but before we can do that, we have two problems to solve:

  • Different measures are on very different scales, making visual comparison almost impossible when using the same y-axis.
  • The plotting function we want to use requires the values of all measures to be in a single column (“long” format).

Let’s do some piping!

First, we write a little helper function to compute z-scores (we could also skip this step and use the in-built function scale() instead):

zscores <- function(x) {
  (x - mean(x)) / sd(x)
}

Then, we use this function to mutate our data, before transforming it to long format (and making a factor out of the new Measure column):

romane_long <- romane |>
  mutate(Type_token_ratio = zscores(Type_token_ratio),
         Honore_H = zscores(Honore_H),
         MTLD = zscores(MTLD),
         Dispersion = zscores(Dispersion),
         Disparity = zscores(Disparity),
         Evenness = zscores(Evenness),
         Density = zscores(Density),
         Rarity = zscores(Rarity),
         Average_token_length_syllables = zscores(Average_token_length_syllables)) |>
  pivot_longer(cols = Type_token_ratio:Honore_H,
               values_to = "Value", names_to = "Measure") |>
  mutate(
    Measure = factor(
      Measure,
      levels = c("Average_token_length_syllables",
                 "Type_token_ratio",
                 "Honore_H",
                 "MTLD",
                 "Disparity",
                 "Dispersion",
                 "Evenness",
                 "Density",
                 "Rarity"),
      labels = c("Mean token length in syllables",
                 "Type-token ratio",
                 "Honoré's H",
                 "McCarthy and Jarvis' MTLD",
                 "Semantic Disparity",
                 "Dispersion",
                 "Evenness",
                 "Lexical density",
                 "Rarity")
    )
  )

romane_long

Finally, we can plot it:

romane_long |>
  ggplot(aes(x = Measure, y = Value, colour = Genre)) +
  geom_boxplot(outlier.alpha = .5) +
  theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
  labs(y = "z-score", title = "Standardised measures by genre")

Summarising data

  • group_by() creates a grouped tibble
  • summarise is then used for arbitrary operations (sums, means, standard deviations, …) which are performed by group

Typical descriptive statistics you may want to use:

  • n(): current group size
  • mean(): arithmetic mean
  • mean(trim = .1): trimmed mean (fraction of trim removed from both the lowest and the highest values)
  • median(): median
  • var(): (sample) variance (with Bessel’s correction)
  • sd(): (sample) standard deviation (with Bessel’s correction)
  • min(), max(): lowest and highest value in a vector
  • quantile(): sample quantiles (quartiles by default)
  • IQR(): interquartile range (the difference between upper and lower quartile)
  • Different packages offer functions for skew and kurtosis (though most of them actually mean excess, not kurtosis), e.g. psych::skew() and psych::kurtosi().
gen_blogs |> group_by(Length) |> 
  summarise(Lemma_count = n(), s_genitives = sum(s.Genitiv), 
            es_genitives = sum(es.Genitiv))
gen_blogs |> group_by(Syllables) |> 
  summarise(Lemma_count = n(), s_genitives = sum(s.Genitiv), 
            es_genitives = sum(es.Genitiv))

Let’s see some summary statistics for one of the variables in romane:

romane |> group_by(Genre) |>
  summarise(n = n(),
            TTR_mean = mean(Type_token_ratio),
            TTR_median = median(Type_token_ratio),
            TTR_sd = sd(Type_token_ratio),
            TTR_IQR = IQR(Type_token_ratio),
            TTR_min = min(Type_token_ratio),
            TTR_max = max(Type_token_ratio),
            TTR_skew = skew(Type_token_ratio),
            TTR_excess = kurtosi(Type_token_ratio))

Lots of packages offer convenient summary functions. Here are just a few examples:

summary(romane) # base R function
      ID                        Genre              Kategorie  
 Length:269         Hochliteratur  :60   Hochliteratur  : 60  
 Class :character   Horror         :51   Schemaliteratur:209  
 Mode  :character   Krimi          :38                        
                    Liebesroman    :60                        
                    Science-Fiction:60                        
                                                              
    Autor              Titel           Type_token_ratio   Dispersion    
 Length:269         Length:269         Min.   :0.2458   Min.   :0.7879  
 Class :character   Class :character   1st Qu.:0.2859   1st Qu.:0.8058  
 Mode  :character   Mode  :character   Median :0.3004   Median :0.8107  
                                       Mean   :0.3045   Mean   :0.8128  
                                       3rd Qu.:0.3189   3rd Qu.:0.8199  
                                       Max.   :0.4199   Max.   :0.8485  
   Disparity         Evenness         Density           Rarity      
 Min.   :0.5449   Min.   :0.9144   Min.   :0.3383   Min.   :0.4805  
 1st Qu.:0.6062   1st Qu.:0.9309   1st Qu.:0.3879   1st Qu.:0.5893  
 Median :0.6283   Median :0.9338   Median :0.4173   Median :0.6201  
 Mean   :0.6347   Mean   :0.9339   Mean   :0.4136   Mean   :0.6193  
 3rd Qu.:0.6616   3rd Qu.:0.9370   3rd Qu.:0.4330   3rd Qu.:0.6475  
 Max.   :0.8094   Max.   :0.9472   Max.   :0.4852   Max.   :0.7340  
 Average_token_length_syllables      MTLD          Honore_H   
 Min.   :1.544                  Min.   :107.0   Min.   :2187  
 1st Qu.:1.630                  1st Qu.:187.4   1st Qu.:2454  
 Median :1.667                  Median :203.9   Median :2561  
 Mean   :1.697                  Mean   :205.6   Mean   :2629  
 3rd Qu.:1.752                  3rd Qu.:221.0   3rd Qu.:2705  
 Max.   :2.027                  Max.   :346.9   Max.   :3725  
psych::describe(romane) # psych; there's also describeBy() for groups
skim(romane) # skimr; works with group_by() and can be customised
Data summary
Name romane
Number of rows 269
Number of columns 14
_______________________
Column type frequency:
character 3
factor 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 12 12 0 269 0
Autor 0 1 9 23 0 115 0
Titel 0 1 5 94 0 269 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Genre 0 1 FALSE 5 Hoc: 60, Lie: 60, Sci: 60, Hor: 51
Kategorie 0 1 FALSE 2 Sch: 209, Hoc: 60

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Type_token_ratio 0 1 0.30 0.03 0.25 0.29 0.30 0.32 0.42 ▃▇▃▁▁
Dispersion 0 1 0.81 0.01 0.79 0.81 0.81 0.82 0.85 ▂▇▃▂▁
Disparity 0 1 0.63 0.04 0.54 0.61 0.63 0.66 0.81 ▃▇▃▁▁
Evenness 0 1 0.93 0.00 0.91 0.93 0.93 0.94 0.95 ▁▁▇▇▂
Density 0 1 0.41 0.03 0.34 0.39 0.42 0.43 0.49 ▁▇▇▇▂
Rarity 0 1 0.62 0.05 0.48 0.59 0.62 0.65 0.73 ▁▃▇▅▂
Average_token_length_syllables 0 1 1.70 0.09 1.54 1.63 1.67 1.75 2.03 ▇▇▃▂▁
MTLD 0 1 205.59 34.75 107.01 187.36 203.88 220.99 346.92 ▁▇▇▂▁
Honore_H 0 1 2628.62 263.70 2186.66 2453.95 2561.41 2705.48 3725.18 ▇▇▂▁▁
romane |>
  group_by(Genre) |>
  skim()
Data summary
Name group_by(romane, Genre)
Number of rows 269
Number of columns 14
_______________________
Column type frequency:
character 3
factor 1
numeric 9
________________________
Group variables Genre

Variable type: character

skim_variable Genre n_missing complete_rate min max empty n_unique whitespace
ID Hochliteratur 0 1 12 12 0 60 0
ID Horror 0 1 12 12 0 51 0
ID Krimi 0 1 12 12 0 38 0
ID Liebesroman 0 1 12 12 0 60 0
ID Science-Fiction 0 1 12 12 0 60 0
Autor Hochliteratur 0 1 9 23 0 58 0
Autor Horror 0 1 13 18 0 4 0
Autor Krimi 0 1 13 13 0 1 0
Autor Liebesroman 0 1 11 18 0 42 0
Autor Science-Fiction 0 1 10 21 0 10 0
Titel Hochliteratur 0 1 5 94 0 60 0
Titel Horror 0 1 9 36 0 51 0
Titel Krimi 0 1 8 31 0 38 0
Titel Liebesroman 0 1 17 41 0 60 0
Titel Science-Fiction 0 1 11 30 0 60 0

Variable type: factor

skim_variable Genre n_missing complete_rate ordered n_unique top_counts
Kategorie Hochliteratur 0 1 FALSE 1 Hoc: 60, Sch: 0
Kategorie Horror 0 1 FALSE 1 Sch: 51, Hoc: 0
Kategorie Krimi 0 1 FALSE 1 Sch: 38, Hoc: 0
Kategorie Liebesroman 0 1 FALSE 1 Sch: 60, Hoc: 0
Kategorie Science-Fiction 0 1 FALSE 1 Sch: 60, Hoc: 0

Variable type: numeric

skim_variable Genre n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Type_token_ratio Hochliteratur 0 1 0.32 0.04 0.25 0.30 0.33 0.35 0.42 ▃▅▇▃▁
Type_token_ratio Horror 0 1 0.29 0.02 0.27 0.28 0.30 0.30 0.35 ▅▃▇▁▁
Type_token_ratio Krimi 0 1 0.29 0.01 0.26 0.28 0.29 0.30 0.31 ▂▁▇▇▅
Type_token_ratio Liebesroman 0 1 0.29 0.02 0.25 0.28 0.29 0.30 0.33 ▁▇▇▇▂
Type_token_ratio Science-Fiction 0 1 0.32 0.03 0.27 0.30 0.31 0.34 0.38 ▃▇▃▃▁
Dispersion Hochliteratur 0 1 0.82 0.01 0.79 0.82 0.83 0.83 0.85 ▂▂▇▇▃
Dispersion Horror 0 1 0.81 0.01 0.79 0.81 0.81 0.81 0.83 ▂▅▇▂▁
Dispersion Krimi 0 1 0.81 0.01 0.79 0.81 0.81 0.81 0.82 ▁▁▂▇▃
Dispersion Liebesroman 0 1 0.80 0.01 0.79 0.80 0.81 0.81 0.82 ▂▅▇▇▂
Dispersion Science-Fiction 0 1 0.82 0.01 0.80 0.81 0.82 0.82 0.83 ▃▇▅▇▂
Disparity Hochliteratur 0 1 0.67 0.03 0.59 0.65 0.68 0.70 0.72 ▂▁▇▇▇
Disparity Horror 0 1 0.61 0.02 0.56 0.59 0.61 0.63 0.68 ▅▇▇▁▁
Disparity Krimi 0 1 0.63 0.03 0.58 0.62 0.64 0.65 0.68 ▂▅▇▆▂
Disparity Liebesroman 0 1 0.63 0.04 0.54 0.61 0.62 0.64 0.81 ▂▇▂▁▁
Disparity Science-Fiction 0 1 0.62 0.04 0.56 0.60 0.62 0.64 0.73 ▃▇▆▃▁
Evenness Hochliteratur 0 1 0.93 0.01 0.91 0.93 0.93 0.94 0.95 ▂▃▇▇▂
Evenness Horror 0 1 0.93 0.00 0.92 0.93 0.93 0.93 0.94 ▁▃▇▃▁
Evenness Krimi 0 1 0.93 0.00 0.93 0.93 0.93 0.94 0.94 ▃▇▃▃▂
Evenness Liebesroman 0 1 0.93 0.00 0.92 0.93 0.93 0.94 0.94 ▁▃▇▇▁
Evenness Science-Fiction 0 1 0.94 0.00 0.93 0.93 0.94 0.94 0.95 ▅▇▆▆▂
Density Hochliteratur 0 1 0.41 0.03 0.34 0.39 0.41 0.43 0.46 ▂▃▇▆▅
Density Horror 0 1 0.42 0.03 0.37 0.38 0.42 0.43 0.46 ▆▁▃▇▃
Density Krimi 0 1 0.41 0.02 0.38 0.40 0.41 0.43 0.44 ▆▇▃▆▇
Density Liebesroman 0 1 0.39 0.02 0.36 0.38 0.39 0.40 0.42 ▃▇▅▆▃
Density Science-Fiction 0 1 0.45 0.02 0.40 0.43 0.44 0.46 0.49 ▂▇▇▅▂
Rarity Hochliteratur 0 1 0.61 0.06 0.48 0.58 0.62 0.65 0.73 ▂▅▇▅▃
Rarity Horror 0 1 0.63 0.03 0.57 0.61 0.63 0.64 0.71 ▅▆▇▂▁
Rarity Krimi 0 1 0.59 0.03 0.51 0.59 0.60 0.62 0.65 ▁▂▂▇▃
Rarity Liebesroman 0 1 0.59 0.03 0.50 0.57 0.59 0.61 0.65 ▁▂▇▇▂
Rarity Science-Fiction 0 1 0.67 0.03 0.60 0.65 0.66 0.70 0.73 ▁▆▇▃▅
Average_token_length_syllables Hochliteratur 0 1 1.69 0.07 1.54 1.64 1.70 1.72 1.83 ▂▇▇▅▂
Average_token_length_syllables Horror 0 1 1.63 0.04 1.57 1.60 1.63 1.65 1.79 ▅▇▂▁▁
Average_token_length_syllables Krimi 0 1 1.64 0.04 1.59 1.61 1.64 1.65 1.76 ▆▇▁▁▂
Average_token_length_syllables Liebesroman 0 1 1.66 0.04 1.55 1.63 1.66 1.70 1.76 ▁▇▆▇▂
Average_token_length_syllables Science-Fiction 0 1 1.83 0.07 1.71 1.79 1.82 1.87 2.03 ▃▇▅▂▁
MTLD Hochliteratur 0 1 186.32 47.16 107.01 153.17 185.44 211.27 295.81 ▅▅▇▂▂
MTLD Horror 0 1 203.69 15.50 176.89 191.90 203.88 213.40 241.53 ▆▆▇▅▁
MTLD Krimi 0 1 199.56 18.50 168.85 186.00 194.60 210.57 254.07 ▆▇▃▃▁
MTLD Liebesroman 0 1 204.60 19.66 163.86 189.53 206.92 216.00 265.29 ▃▆▇▂▁
MTLD Science-Fiction 0 1 231.27 37.14 178.74 201.90 227.06 248.72 346.92 ▇▇▂▂▁
Honore_H Hochliteratur 0 1 2913.82 354.02 2241.32 2629.40 2903.46 3133.62 3725.18 ▃▇▇▃▂
Honore_H Horror 0 1 2546.50 128.74 2335.98 2442.36 2548.56 2610.74 3009.31 ▆▇▅▁▁
Honore_H Krimi 0 1 2449.41 73.96 2230.01 2420.46 2446.24 2486.39 2626.74 ▁▁▇▃▂
Honore_H Liebesroman 0 1 2505.85 130.14 2186.66 2430.71 2501.20 2573.43 2882.01 ▁▇▇▃▁
Honore_H Science-Fiction 0 1 2649.50 176.44 2403.01 2501.29 2603.43 2752.55 3071.16 ▇▆▅▃▂

Does the lemma end in s, ß, z or x?

gen_blogs$Ends_in_s <- factor(ifelse(str_sub(gen_blogs$Lemma, start = -1) %in% c("s", "ß", "z", "x"), "yes", "no"))
gen_blogs
gen_blogs |> group_by(Ends_in_s) |>
  summarise(s = sum(s.Genitiv), es = sum(es.Genitiv))

Handling missing data

Missing data should always be NA in R. Pay special attention to characters vectors – empty strings should probably often be NA as well.

Many functions will throw errors when they encounter missing values. Luckily, many of them (e.g. mean(), sd()) also have the optional argument na.rm that you can set to TRUE, so they’ll ignore any missing data.

The function na.omit() will drop missing values from a vector; it will drop all rows containing missing values from a matrix or data.frame. So think carefully before using it on whole data sets – you might throw away useful data. (The same goes for the Tidyverse function drop_na().)

The package tidyr (from the Tidyverse) also provides further useful functions like replace_na() (say you want values of 0 instead of NA for specific columns) or fill().